49 research outputs found
Transferring Visual Attributes from Natural Language to Verified Image Generation
Text to image generation methods (T2I) are widely popular in generating art
and other creative artifacts. While visual hallucinations can be a positive
factor in scenarios where creativity is appreciated, such artifacts are poorly
suited for cases where the generated image needs to be grounded in complex
natural language without explicit visual elements. In this paper, we propose to
strengthen the consistency property of T2I methods in the presence of natural
complex language, which often breaks the limits of T2I methods by including
non-visual information, and textual elements that require knowledge for
accurate generation. To address these phenomena, we propose a Natural Language
to Verified Image generation approach (NL2VI) that converts a natural prompt
into a visual prompt, which is more suitable for image generation. A T2I model
then generates an image for the visual prompt, which is then verified with VQA
algorithms. Experimentally, aligning natural prompts with image generation can
improve the consistency of the generated images by up to 11% over the state of
the art. Moreover, improvements can generalize to challenging domains like
cooking and DIY tasks, where the correctness of the generated image is crucial
to illustrate actions
On the Robustness of Dialogue History Representation in Conversational Question Answering: A Comprehensive Study and a New Prompt-based Method
AbstractMost work on modeling the conversation history in Conversational Question Answering (CQA) reports a single main result on a common CQA benchmark. While existing models show impressive results on CQA leaderboards, it remains unclear whether they are robust to shifts in setting (sometimes to more realistic ones), training data size (e.g., from large to small sets) and domain. In this work, we design and conduct the first large-scale robustness study of history modeling approaches for CQA. We find that high benchmark scores do not necessarily translate to strong robustness, and that various methods can perform extremely differently under different settings. Equipped with the insights from our study, we design a novel prompt-based history modeling approach and demonstrate its strong robustness across various settings. Our approach is inspired by existing methods that highlight historic answers in the passage. However, instead of highlighting by modifying the passage token embeddings, we add textual prompts directly in the passage text. Our approach is simple, easy to plug into practically any model, and highly effective, thus we recommend it as a starting point for future model developers. We also hope that our study and insights will raise awareness to the importance of robustness-focused evaluation, in addition to obtaining high leaderboard scores, leading to better CQA systems.
What You See is What You Read? Improving Text-Image Alignment Evaluation
Automatically determining whether a text and a corresponding image are
semantically aligned is a significant challenge for vision-language models,
with applications in generative text-to-image and image-to-text tasks. In this
work, we study methods for automatic text-image alignment evaluation. We first
introduce SeeTRUE: a comprehensive evaluation set, spanning multiple datasets
from both text-to-image and image-to-text generation tasks, with human
judgements for whether a given text-image pair is semantically aligned. We then
describe two automatic methods to determine alignment: the first involving a
pipeline based on question generation and visual question answering models, and
the second employing an end-to-end classification approach by finetuning
multimodal pretrained models. Both methods surpass prior approaches in various
text-image alignment tasks, with significant improvements in challenging cases
that involve complex composition or unnatural images. Finally, we demonstrate
how our approaches can localize specific misalignments between an image and a
given text, and how they can be used to automatically re-rank candidates in
text-to-image generation
MaXM: Towards Multilingual Visual Question Answering
Visual Question Answering (VQA) has been primarily studied through the lens
of the English language. Yet, tackling VQA in other languages in the same
manner would require a considerable amount of resources. In this paper, we
propose scalable solutions to multilingual visual question answering (mVQA), on
both data and modeling fronts. We first propose a translation-based framework
to mVQA data generation that requires much less human annotation efforts than
the conventional approach of directly collection questions and answers. Then,
we apply our framework to the multilingual captions in the Crossmodal-3600
dataset and develop an efficient annotation protocol to create MaXM, a
test-only VQA benchmark in 7 diverse languages. Finally, we develop a simple,
lightweight, and effective approach as well as benchmark state-of-the-art
English and multilingual VQA models. We hope that our benchmark encourages
further research on mVQA.Comment: EMNLP 2023 (Findings).
https://github.com/google-research-datasets/max
Dynamic Planning in Open-Ended Dialogue using Reinforcement Learning
Despite recent advances in natural language understanding and generation, and
decades of research on the development of conversational bots, building
automated agents that can carry on rich open-ended conversations with humans
"in the wild" remains a formidable challenge. In this work we develop a
real-time, open-ended dialogue system that uses reinforcement learning (RL) to
power a bot's conversational skill at scale. Our work pairs the succinct
embedding of the conversation state generated using SOTA (supervised) language
models with RL techniques that are particularly suited to a dynamic action
space that changes as the conversation progresses. Trained using crowd-sourced
data, our novel system is able to substantially exceeds the (strong) baseline
supervised model with respect to several metrics of interest in a live
experiment with real users of the Google Assistant